Business Analytics

Linear Regression

Ayush Patel

22 January, 2024

Pre-requisite

You already…

  • Know basics of data wrangling in R
  • Know basics of data visualization in R
  • Have a knowledge of basic statistics

Before we begin

Please install and load the following packages

library(tidyverse)
library(MASS)
library(openintro)
library(palmerpenguins)



Access lecture slide from the course landing page

About Me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Central Tendency - Refresher

Content for this topic has been sourced from Danielle Navarro’s ‘Learning statistics with R’. Please check out her work for detailed information.

  • To get a ‘gist’ of data, it is useful to condense data into ‘summary’ statistics
  • Often, the measures of central tendency (mean, median, mode) are used
  • In simple words
    • Mean - the average; adding all values and dividing by the number of observations
    • Median - middle value of the observations
    • Mode - the value that occurs most frequently

Central Tendency - Refresher

Content for this topic has been sourced from Danielle Navarro’s ‘Learning statistics with R’. Please check out her work for detailed information.

Ok, but what do these measures imply?

  • Mean is affected by outliers
  • Median is not affected by outliers
  • Example - Mean salary of an Indian will be higher than median Indian salary, same with house prices

Variance and Correlation - Refresher

Content for this topic has been sourced from Danielle Navarro’s ‘Learning statistics with R’. Please check out her work for detailed information.

  • Central tendency shows at the middle and ‘popular’ point of the data
  • But, variance shows how spread out the values are from the averages
  • Variability, or variance, is seen by
    • Range - Difference between largest and smallest value
    • Interquartile Range - Difference between the 25th and 75th quartile
    • Variance - Average squared deviation from the mean
  • To see relationship between variables, we check correlation

What is Linear Regression?

Content for this topic has been sourced from the book ‘An Introduction to Statistical Learning with applications in R’. Please check out the book for detailed information.

  • A linear model can help us answer questions about association between response and predictors, predict sales in future, linearity of relation, and interaction between predictors
  • Linear regression, in simple words, is useful for predicting a quantitative response
  • Linear Regression with one variable - predicting a quantitative response Y on the basis of a single predictor variable X

Prediction with linear regression

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Use the possum dataset from the openintro package
  • How can we predict head length from total length?
  • Scatterplot of total length and head length
possum %>%
ggplot(aes(x = total_l, y = head_l))+
  geom_point() +
  theme_light()

Prediction with linear regression

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Independent or Predictor Variable (x) - Total length
  • Dependent or Response Variable (y) - Head Length
possum %>%
ggplot(aes(x = total_l, y = head_l))+
  geom_point()+
  geom_smooth(
    method = "lm"
    )+#lm method used to fit linear model
  theme_light()

Do it Yourself - 1

  • Use the husbands_wives data from openintro
  • Make a scatterplot of the ages of husbands and wives
  • To the previous scatterplot, fit the trendline
  • What is the relationship between the two variables like?
  • Repeat points 2-5, but for heights of husbands and wives

Prediction with linear regression - Linear Regression Equation

lm(head_l ~ total_l, data = possum)

Call:
lm(formula = head_l ~ total_l, data = possum)

Coefficients:
(Intercept)      total_l  
    42.7098       0.5729  
  • The equation can be written as \[\hat{y} = 42.71 + 0.57x\]
  • We will learn the intuition behind this model soon

Residual

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Leftover variation that the model does not account for
  • Data = Fit + Residual

Source : Introduction to Modern Statistics

Writing a Linear Regression Equation

A general equation for linear model can be written as \[ Y \approx \beta_0 + \beta_1X \]

\[\beta_0\hspace{1mm} is\hspace{1mm}population\hspace{1mm}intercept\]

\[\beta_1\hspace{1mm} is\hspace{1mm}population\hspace{1mm}slope\] Our estimates are represented as :

\[ \hat\beta_0\] \[\hat\beta_1\]

Do it Yourself - 2

  • Using the starbucks dataset, find the regression equation for relationship between the ages of husbands and wives
  • Using the teacher dataset, find the relationship between the amount paid to fica and the base salary of teachers

Least Squares Regression

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • It is a rigorous approach to fitting a line to a scatterplot
  • Use the elmhurst data from openintro package
elmhurst %>%
  ggplot(aes(x = family_income, y = gift_aid))+
  geom_point() +
  theme_light()

What is the best estimate

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • The idea is to, essentially, draw a line through the points such that distance of every point from line is as small a possible i.e. least residuals
  • The dashed line represents the line that minimizes the sum of the absolute value of residuals, the solid line represents the line that minimizes the sum of squared residuals

Source : Introduction to Modern Statistics

Least Square Line

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • One way to get estimates of population coefficients or parameters is minimizing least squares
  • \[aid \approx \beta_0 + \beta_1*income\]

\[\hat y_i = \hat\beta_0 + \hat\beta_1x_i\] \[e_i = y_i - \hat y_i\]

\[RSS = e_1^2 + e_2^2....+e_n^2\]

Minimzie RSS

Least square coefficient estimates

\[ \hat\beta_1 = \frac{\sum_i^n(x_i - \bar x)(y_i - \bar y)}{\sum_i^n(x_i - \bar x)^2} \]

\[ \hat\beta_0 = \bar y - \hat\beta_1\bar x \]

Interpreting Regression Equation

  • The lm() function is used to fit linear models in R
  • model_name <- (response_variable ~ predictor_variable , data)
lm( gift_aid ~ family_income, data = elmhurst)

Call:
lm(formula = gift_aid ~ family_income, data = elmhurst)

Coefficients:
  (Intercept)  family_income  
     24.31933       -0.04307  

Interpreting Regression Equation

  • The intercept 24.3193 signifies the gift aid when the family income is 0
  • For every increase in $1 of family income, the gift aid given to the student reduces by $0.0431
  • Also means - for an increase of $1000 of family income, gift aid reduces by $43.1

Do it Yourself - 3

  • From the previous regression equation using the teacher dataset, try to interpret the results from the intercept and the coefficient
  • Using the bdims dataset, find the regression equation for how the weight is related to height and interpret the results

Categorical Variable

  • mariokart data in openintro
  • cond variable specifies whether the device is old or new
mariokart %>%
  filter(total_pr <= 100) %>%
  ggplot(aes(x = cond, y = total_pr))+
  geom_point()+
  theme_light()

Categorical Variable

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

mariokart <- mariokart %>%
  filter(total_pr <= 100)
lm(total_pr ~ cond, data = mariokart)

Call:
lm(formula = total_pr ~ cond, data = mariokart)

Coefficients:
(Intercept)     condused  
      53.77       -10.90  
  • New devices get sold on average at $53.77
  • On average, used devices get sold at $10.90 less than new ones

Do it Yourself - 4

  • Using the teacher dataset, see how the degree that the teacher has affects their base salaries
  • What do the results mean?
  • Using the census data, check how sex of the individual affects their personal income
  • What could be the possible reasons for the result

Linear Regression with Multiple Predictors

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • The outcome variable is dependent on various predictors
  • loans_full_schema dataset from openintro
# A tibble: 10,000 × 55
   emp_title        emp_length state homeownership annual_income verified_income
   <chr>                 <dbl> <fct> <fct>                 <dbl> <fct>          
 1 "global config …          3 NJ    MORTGAGE              90000 Verified       
 2 "warehouse offi…         10 HI    RENT                  40000 Not Verified   
 3 "assembly"                3 WI    RENT                  40000 Source Verified
 4 "customer servi…          1 PA    RENT                  30000 Not Verified   
 5 "security super…         10 CA    RENT                  35000 Verified       
 6 ""                       NA KY    OWN                   34000 Not Verified   
 7 "hr "                    10 MI    MORTGAGE              35000 Source Verified
 8 "police"                 10 AZ    MORTGAGE             110000 Source Verified
 9 "parts"                  10 NV    MORTGAGE              65000 Source Verified
10 "4th person"              3 IL    RENT                  30000 Not Verified   
# ℹ 9,990 more rows
# ℹ 49 more variables: debt_to_income <dbl>, annual_income_joint <dbl>,
#   verification_income_joint <fct>, debt_to_income_joint <dbl>,
#   delinq_2y <int>, months_since_last_delinq <int>,
#   earliest_credit_line <dbl>, inquiries_last_12m <int>,
#   total_credit_lines <int>, open_credit_lines <int>,
#   total_credit_limit <int>, total_credit_utilized <int>, …

Linear Regression with Multiple Predictors

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

lm(interest_rate ~ public_record_bankrupt, data = loans)

Call:
lm(formula = interest_rate ~ public_record_bankrupt, data = loans)

Coefficients:
           (Intercept)  public_record_bankrupt  
               12.3403                  0.7042  
  • Individuals with bankruptcy record have a 0.74% higher interest rate than those who do not

Linear Regression with Multiple Predictors

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • There are categorical variables with more than two levels
lm(interest_rate ~ verified_income, data = loans)

Call:
lm(formula = interest_rate ~ verified_income, data = loans)

Coefficients:
                   (Intercept)  verified_incomeSource Verified  
                        11.099                           1.416  
       verified_incomeVerified  
                         3.254  
  • The results show the the relative difference for each level of verified_income
  • Reference levelrepresents the default level that other levels are measured against
  • When fitting a regression model with a categorical variable that has k levels where k > 2, we get a coefficient for k−1 of those levels

Confounding Variables

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Surprisingly, people with verified income have a higher interest rate. Why so?
  • There might be confounding variables
  • Maybe people who verified their income source in the first place did so because they had poor credit

Linear Regression with Multiple Predictors

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

lm(interest_rate ~ verified_income + debt_to_income + public_record_bankrupt +term + credit_util + issue_month, data = loans)

Call:
lm(formula = interest_rate ~ verified_income + debt_to_income + 
    public_record_bankrupt + term + credit_util + issue_month, 
    data = loans)

Coefficients:
                   (Intercept)  verified_incomeSource Verified  
                       2.23430                         1.09980  
       verified_incomeVerified                  debt_to_income  
                       2.66796                         0.02276  
        public_record_bankrupt                            term  
                       0.48942                         0.15417  
                   credit_util             issue_monthJan-2018  
                       4.83832                         0.04826  
           issue_monthMar-2018  
                      -0.04700  

Do it Yourself - 5

  • Using the penguins dataset from the palmerpenguins package, find and interpret the regression equation for the body mass of the penguins and their species
  • To the previous model, add the variables of sex, bill length, bill depth and island
  • Now try to understand the results

Thank You :)